{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "CD2QfkE-OV_q",
    "papermill": {
     "duration": 0.006583,
     "end_time": "2021-02-24T00:00:00.388859",
     "exception": false,
     "start_time": "2021-02-24T00:00:00.382276",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "# Analyzing RCT with Precision by Adjusting for Baseline Covariates"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "pCj5gpZAOap8"
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import statsmodels.formula.api as smf"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "wCK49jmFOV_y",
    "papermill": {
     "duration": 0.005433,
     "end_time": "2021-02-24T00:00:00.399979",
     "exception": false,
     "start_time": "2021-02-24T00:00:00.394546",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "# Jonathan Roth's DGP\n",
    "\n",
    "Here we set up a DGP with heterogenous effects. In this example, with is due to Jonathan Roth, we have\n",
    "$$\n",
    "E [Y(0) | Z] = - Z, \\quad E [Y(1) |Z] = Z, \\quad Z \\sim N(0,1).\n",
    "$$\n",
    "The CATE is\n",
    "$$\n",
    "E [Y(1) - Y(0) | Z ]= 2 Z.\n",
    "$$\n",
    "and the ATE is\n",
    "$$\n",
    "2 E Z = 0.\n",
    "$$\n",
    "\n",
    "We would like to estimate ATE as precisely as possible.\n",
    "\n",
    "An economic motivation for this example could be provided as follows: Let D be the treatment of going to college, and let $Z$ be academic skills.  Suppose that academic skills cause lower earnings Y(0) in jobs that don't require a college degree, and cause higher earnings  Y(1) in jobs that require college degrees. This type of scenario is reflected in the DGP set-up above.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "2NSMUY6qOV_0",
    "papermill": {
     "duration": 0.182998,
     "end_time": "2021-02-24T00:00:00.589410",
     "exception": false,
     "start_time": "2021-02-24T00:00:00.406412",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "def gen_data(random_seed):\n",
    "    np.random.seed(random_seed)\n",
    "    n = 1000             # sample size\n",
    "    Z = np.random.normal(size=n)         # generate Z\n",
    "    Y0 = -Z + np.random.normal(0, 0.1, size=n)   # conditional average baseline response is -Z\n",
    "    Y1 = Z + np.random.normal(0, 0.1, size=n)    # conditional average treatment effect is +Z\n",
    "    D = np.random.binomial(1, .2, size=n)    # treatment indicator; only 20% get treated\n",
    "    Y = Y1 * D + Y0 * (1 - D)  # observed Y\n",
    "    data = pd.DataFrame({\"Y\": Y, \"D\": D, \"Z\": 1 + Z})  # we artificially add an intercept to the covariates\n",
    "    return data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "IEYPeB5tOV_1",
    "papermill": {
     "duration": 0.005616,
     "end_time": "2021-02-24T00:00:00.601042",
     "exception": false,
     "start_time": "2021-02-24T00:00:00.595426",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "# Analyze the RCT data with Precision Adjustment\n",
    "\n",
    "Consider\n",
    "\n",
    "*  classical 2-sample approach, no adjustment (CL)\n",
    "*  classical linear regression adjustment (CRA)\n",
    "*  interactive regression adjusment (IRA)\n",
    "\n",
    "Carry out inference using robust inference, using the sandwich formulas (Eicker-Huber-White).  \n",
    "\n",
    "Observe that CRA delivers estimates that are less efficient than CL (pointed out by Freedman), whereas IRA delivers estimates that are more efficient (pointed out by Lin). In order for CRA to be more efficient than CL, we need the linear model to be a correct model of the conditional expectation function of Y given D and X, which is not the case here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = gen_data(123)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "xfV4PjpzOV_1",
    "papermill": {
     "duration": 0.28228,
     "end_time": "2021-02-24T00:00:00.888966",
     "exception": false,
     "start_time": "2021-02-24T00:00:00.606686",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "CL = smf.ols(\"Y ~ D\", data=data).fit()\n",
    "# we are interested in the coefficients on variable \"D\".\n",
    "CL.get_robustcov_results(cov_type=\"HC0\").summary()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "CRA = smf.ols(\"Y ~ D + Z\", data=data).fit()      # classical\n",
    "CRA.get_robustcov_results(cov_type=\"HC0\").summary()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# if we demean the covariates, then the intercept can be interpreted\n",
    "# as an estimate of the expected outcome under control\n",
    "data['Zdemean'] = data['Z'] - data['Z'].mean(axis=0)\n",
    "CRA = smf.ols(\"Y ~ D + Zdemean\", data=data).fit()\n",
    "CRA.get_robustcov_results(cov_type=\"HC0\").summary()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# However, then we need to correct the standard error associated\n",
    "# with the intercept, to account for the variance in estimating the means.\n",
    "# The standard error for D does not need any correction\n",
    "J = np.mean(1 - data['D'])\n",
    "# the HC0 standard error for the intercept is the second moment of the following score quantity\n",
    "score = CRA.resid * (1 - data['D']) / J\n",
    "# however, now we need to add a correction to the score to account for the\n",
    "# error in the means\n",
    "score += data[['Zdemean']] @ CRA.params[['Zdemean']]\n",
    "print(f\"Corrected stderr['Intercept']: {np.sqrt(np.mean(score**2) / len(data)):.4f}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# for the interactive approach, we need to demean the covariates Z to interpret\n",
    "# the coefficient of D as the ATE\n",
    "IRA = smf.ols(\"Y ~ D + Zdemean + Zdemean*D\", data=data).fit()  # interactive approach\n",
    "IRA.get_robustcov_results(cov_type=\"HC1\").summary()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# However, in the interactive approach we also need to correct\n",
    "# the standard error associated with D, to account for the estimation of the means\n",
    "correction = np.var(data[['Zdemean']].values @ IRA.params[['Zdemean:D']]) / len(data)\n",
    "print(f\"Corrected stderr['D']: {np.sqrt(IRA.HC0_se['D']**2 + correction):.4f}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# And as before we need to correct the standard error associated\n",
    "# with the intercept, to account for the variance in estimating the means.\n",
    "J = np.mean(1 - data['D'])\n",
    "score = (IRA.resid * (1 - data['D']) + J * data[['Zdemean']] @ IRA.params[['Zdemean']]) / J\n",
    "print(f\"Corrected stderr['Intercept']: {np.sqrt(np.mean(score**2) / len(data)):.4f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "2wHiWkMXOV_3",
    "papermill": {
     "duration": 0.007983,
     "end_time": "2021-02-24T00:00:00.905362",
     "exception": false,
     "start_time": "2021-02-24T00:00:00.897379",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "# Using classical standard errors (non-robust) is misleading here.\n",
    "\n",
    "We don't teach non-robust standard errors in econometrics courses, but the default statistical inference for the `fit` procedure in python, `smf.ols()`, still uses 100 year old concepts, perhaps in part due to historical legacy.  \n",
    "\n",
    "Here the non-robust standard errors suggest that there is not much difference between the different approaches, contrary to the conclusions reached using the robust standard errors.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "Ukoh8YM9OV_3",
    "papermill": {
     "duration": 0.041385,
     "end_time": "2021-02-24T00:00:00.954907",
     "exception": false,
     "start_time": "2021-02-24T00:00:00.913522",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "smf.ols(\"Y ~ D\", data).fit().summary()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "smf.ols(\"Y ~ D + Z\", data).fit().summary()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "hoT263l7OV_4",
    "papermill": {
     "duration": 0.010058,
     "end_time": "2021-02-24T00:00:00.974929",
     "exception": false,
     "start_time": "2021-02-24T00:00:00.964871",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "# Verify Asymptotic Approximations Hold in Finite-Sample Simulation Experiment"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_execution_state": "idle",
    "_uuid": "051d70d956493feee0c6d64651c6a088724dca2a",
    "id": "QQZG7z_kOV_4",
    "papermill": {
     "duration": 3.55396,
     "end_time": "2021-02-24T00:00:04.538599",
     "exception": false,
     "start_time": "2021-02-24T00:00:00.984639",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "from joblib import Parallel, delayed\n",
    "\n",
    "\n",
    "def exp(it):\n",
    "    data = gen_data(it)\n",
    "    data['Zdemean'] = data['Z'] - data['Z'].mean(axis=0)\n",
    "    CL = smf.ols(\"Y ~ D\", data).fit()\n",
    "    CLcoef = CL.params[\"D\"]\n",
    "    CLint = CL.params[\"Intercept\"]\n",
    "    CRA = smf.ols(\"Y ~ D + Zdemean\", data).fit()\n",
    "    CRAcoef = CRA.params[\"D\"]\n",
    "    CRAint = CRA.params[\"Intercept\"]\n",
    "    IRA = smf.ols(\"Y ~ D + Zdemean+ Zdemean*D\", data).fit()\n",
    "    IRAcoef = IRA.params[\"D\"]\n",
    "    IRAint = IRA.params[\"Intercept\"]\n",
    "    return CLcoef, CLint, CRAcoef, CRAint, IRAcoef, IRAint\n",
    "\n",
    "\n",
    "B = 1000\n",
    "res = Parallel(n_jobs=-1, verbose=3)(delayed(exp)(it) for it in range(B))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "mLfkZGwCdBWj"
   },
   "outputs": [],
   "source": [
    "CLcoefs, CLints, CRAcoefs, CRAints, IRAcoefs, IRAints = map(lambda x: np.array(x), zip(*res))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Standard deviations for ATE based on different estimators\")\n",
    "print(\"Two means ATE std: \", np.std(CLcoefs))\n",
    "print(\"Non-interactive ATE std: \", np.std(CRAcoefs))\n",
    "print(\"Interactive ATE std: \", np.std(IRAcoefs))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Standard deviations for Baseline based on different estimators\")\n",
    "print(\"Two means Baseline std: \", np.std(CLints))\n",
    "print(\"Non-interactive Baseline std: \", np.std(CRAints))\n",
    "print(\"Interactive Baseline std: \", np.std(IRAints))"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.3"
  },
  "papermill": {
   "default_parameters": {},
   "duration": 7.390855,
   "end_time": "2021-02-24T00:00:04.661485",
   "environment_variables": {},
   "exception": null,
   "input_path": "__notebook__.ipynb",
   "output_path": "__notebook__.ipynb",
   "parameters": {},
   "start_time": "2021-02-23T23:59:57.270630",
   "version": "2.2.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
